Small Language Models

An under-appreciated trend bubbling under the surface is the number of techies working on “small language models”. These are models that are almost as good as the Big LLMs, but use a fraction of the resources, both in training and inference. Microsoft started it late last year with their “Phi” models, and Apple accelerated the trend with their announcements of models that will run entirely on-device. Even Sam “Give me $7 Trillion” Altman admits that existing LLM architectures have run out of gas and will require smarter techniques. A lot of GPU infrastructure is being built on the assumption that “the more the better”, but what if that’s not (quite) true. What if having the biggest, baddest model costs orders of magnitudes more but gets you only a barely-noticable improvement in quality?

IEEE Spectrum 20-Jun-2024:

Apple, Microsoft Shrink AI Models to Improve Them “Small language models” emerge as an alternative to gargantuan AI options

For researchers like Alex Warstadt, a computer science researcher at ETH Zurich, SLMs could also offer new, fascinating insights into a longstanding scientific question: How children acquire their first language. Warstadt, alongside a group of researchers including Northeastern’s Mueller, organizes BabyLM, a challenge in which participants optimize language-model training on small data.

Not only could SLMs potentially unlock new secrets of human cognition, but they also help improve generative AI. By the time children turn 13, they’re exposed to about 100 million words and are better than chatbots at language, with access to only 0.01 percent of the data. While no one knows what makes humans so much more efficient, says Warstadt, “reverse engineering efficient humanlike learning at small scales could lead to huge improvements when scaled up to LLM scales.”

Meta released its own study of tips for how to optimise small language models: Liu et al. (2024)

Liu, Z., Zhao, C., Iandola, F., Lai, C., Tian, Y., Fedorov, I., Xiong, Y., Chang, E., Shi, Y., Krishnamoorthi, R., Lai, L., & Chandra, V. (2024). MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases (arXiv:2402.14905). arXiv. http://arxiv.org/abs/2402.14905

Small Machines

TinyLlama on an ESP32

The “Large” Language Model used is actually quite small. It is a 260K parameter tinyllamas checkpoint trained on the tiny stories dataset. The LLM implementation is done using llama.2c with minor optimizations to make it run faster on the ESP32.

Even this small one still requires 1MB of RAM. I used the ESP32-S3FH4R2 because it has 2MB of embedded PSRAM.

References

Liu, Zechun, Changsheng Zhao, Forrest Iandola, Chen Lai, Yuandong Tian, Igor Fedorov, Yunyang Xiong, et al. 2024. “MobileLLM: Optimizing Sub-Billion Parameter Language Models for On-Device Use Cases.” arXiv. http://arxiv.org/abs/2402.14905.

Small Machines

See Also

References